Skip to content

feat telegram: add voice message support for telegram with pluggable#38

Open
soufianebouaddis wants to merge 4 commits intojobrunr:mainfrom
soufianebouaddis:feature/add-voice-message-support-for-telegram
Open

feat telegram: add voice message support for telegram with pluggable#38
soufianebouaddis wants to merge 4 commits intojobrunr:mainfrom
soufianebouaddis:feature/add-voice-message-support-for-telegram

Conversation

@soufianebouaddis
Copy link
Copy Markdown

Add support for voice messages in Telegram channel

This PR extends TelegramChannel to handle voice messages in addition to text.

Changes

  • Refactored TelegramChannel.consume() to process both text and voice messages
  • Download voice messages from Telegram API
  • Introduced a pluggable SpeechToTextService abstraction
  • Transcribed text routed through existing agent.respondTo(...) flow

Transcription

  • Default: MockSpeechToTextService (no external dependency, suitable for testing)
  • Optional: OpenAiSpeechToTextService (enabled via speech.provider=openai)

Notes

  • Existing text message behavior remains unchanged
  • Tests updated and all passing

Next Steps / Ideas

  • Integrate real transcription providers (e.g., OpenAI Whisper, Spring AI AudioTranscriptionModel, or local Whisper plugin)
  • Open to feedback on aligning the abstraction with Spring AI’s AudioTranscriptionModel if preferred

@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented Mar 26, 2026

We require contributors to sign our Contributor License Agreement, and we don't have @soufianebouaddis on file. In order for us to review and merge your code, please create a PR where you add yourself to the contributors of JobRunr. This only needs to be done once. As soon as that is done, we can review your PR.

Thanks a lot!

@rdehuyss
Copy link
Copy Markdown
Contributor

@cla-bot check

@cla-bot cla-bot Bot added the cla-signed label Mar 29, 2026
@cla-bot
Copy link
Copy Markdown

cla-bot Bot commented Mar 29, 2026

The cla-bot has been summoned, and re-checked this pull request!

a-simeshin added a commit to a-simeshin/JavaClaw that referenced this pull request Apr 10, 2026
…тестами для audit/executions/deliveries REST API endpoints (T58, T60) с реальным PostgreSQL через Testcontainers — все тесты зелёные, BUILD SUCCESS
a-simeshin added a commit to a-simeshin/JavaClaw that referenced this pull request Apr 11, 2026
…грация role_agent_config, RoleAgentConfig entity/repository/service с hierarchy fallback и кэшем, интеграция model override через весь pipeline (ChatRestController→SseStreamingService→ChatService), REST API endpoints для управления, 26 новых тестов, все 1147 тестов зелёные
@auloin
Copy link
Copy Markdown
Contributor

auloin commented Apr 15, 2026

Hi @soufianebouaddis thanks for submitting this PR. Sorry for the late review.

From what I can see the actual transcription is yet to be done. I think we should have at least one working implementation. Is this something you'd still like to work on?

My second concern is that we're mixing telegram text and voice messages, is it possible to find a nice abstraction?

@soufianebouaddis
Copy link
Copy Markdown
Author

Hi @auloin, thanks for the review and sorry for the incomplete implementation.

I’ll continue working on this and add a concrete transcription provider so the feature works end-to-end. I’ll also revisit the current design to avoid mixing text and voice handling in TelegramChannel and introduce a cleaner abstraction for message types.

I’ll update the PR shortly with these changes. Thanks for the feedback!

@soufianebouaddis
Copy link
Copy Markdown
Author

Hi @auloin, this update extending TelegramChannel.consume() to handle both text and voice inputs through a single flow. Voice messages are downloaded via TelegramVoiceDownloader, transcribed to text using a SpeechToTextService abstraction, and then passed to agent.respondTo() the same way as text messages.

I added working transcription implementations (local via whisper-cli + ffmpeg, and OpenAI), with a mock still available for testing. The flow normalizes everything to text before reaching the agent, so text and voice are no longer mixed beyond the input layer.

@auloin
Copy link
Copy Markdown
Contributor

auloin commented Apr 17, 2026

Thanks @soufianebouaddis. I'll review it as soon as possible. In the meantime could you already pull the main branch into your branch and solve the conflicts?

@soufianebouaddis
Copy link
Copy Markdown
Author

Hi @auloin, thanks for the heads up I’ll pull the latest changes from main into my branch and resolve the conflicts shortly. I’ll push the updated version once everything is clean.

@soufianebouaddis soufianebouaddis force-pushed the feature/add-voice-message-support-for-telegram branch from 33d5bf0 to 21404e0 Compare April 22, 2026 18:12
@soufianebouaddis
Copy link
Copy Markdown
Author

Hi @auloin,
I ran into an issue when handling voice messages from Telegram. In some cases, the LLM response becomes too large, which leads to a failure when sending the message back.

Here are the relevant logs:
2026-04-22T19:01:36.487+01:00 INFO 80461 --- [JavaClaw] [pool-4-thread-1] a.j.channels.telegram.TelegramChannel : Voice message received, downloading audio
2026-04-22T19:01:37.191+01:00 INFO 80461 --- [JavaClaw] [pool-4-thread-1] a.j.s.WhisperCppSpeechToTextService : Transcribing audio via whisper-cpp (model: /Users/snof/whisper-models/ggml-small.bin)
2026-04-22T19:01:38.542+01:00 INFO 80461 --- [JavaClaw] [pool-4-thread-1] a.j.s.WhisperCppSpeechToTextService : whisper-cpp transcription completed successfully
2026-04-22T19:01:38.543+01:00 INFO 80461 --- [JavaClaw] [pool-4-thread-1] a.j.channels.telegram.TelegramChannel : Voice message transcribed successfully
2026-04-22T19:02:24.338+01:00 WARN 80461 --- [JavaClaw] [pool-4-thread-1] a.j.channels.telegram.TelegramChannel : Failed to send HTML parsed message, falling back to raw text.

Exception in thread "pool-4-thread-1" java.lang.RuntimeException: Failed to send both HTML and fallback messages
at ai.javaclaw.channels.telegram.TelegramChannel.sendMessage(TelegramChannel.java:136)
at ai.javaclaw.channels.telegram.TelegramChannel.consume(TelegramChannel.java:100)
at org.telegram.telegrambots.longpolling.util.LongPollingSingleThreadUpdateConsumer.lambda$consume$0(LongPollingSingleThreadUpdateConsumer.java:15)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
at java.base/java.lang.Thread.run(Thread.java:1474)
Caused by: Error executing org.telegram.telegrambots.meta.api.methods.send.SendMessage query: [400] Bad Request: message is too long
at org.telegram.telegrambots.meta.api.methods.botapimethods.PartialBotApiMethod.deserializeResponseInternal(PartialBotApiMethod.java:63)
at org.telegram.telegrambots.meta.api.methods.botapimethods.PartialBotApiMethod.deserializeResponse(PartialBotApiMethod.java:43)
at org.telegram.telegrambots.meta.api.methods.botapimethods.BotApiMethodMessage.deserializeResponse(BotApiMethodMessage.java:24)
at org.telegram.telegrambots.meta.api.methods.botapimethods.BotApiMethodMessage.deserializeResponse(BotApiMethodMessage.java:17)
at org.telegram.telegrambots.client.okhttp.OkHttpFutureCallback.onResponse(OkHttpFutureCallback.java:35)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:531)

Would you prefer that I implement message chunking to split long responses, or should we explore another approach?

@auloin
Copy link
Copy Markdown
Contributor

auloin commented Apr 27, 2026

Hi @auloin, I ran into an issue when handling voice messages from Telegram. In some cases, the LLM response becomes too large, which leads to a failure when sending the message back.

Would you prefer that I implement message chunking to split long responses, or should we explore another approach?

Interesting finding @soufianebouaddis. Does it happen often? Is it a blocker? I wonder if it can be a separate task so we can keep the scope of this PR small.

Copy link
Copy Markdown
Contributor

@auloin auloin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the work on this @soufianebouaddis. I think we're pretty close, I have a few remarks to see if we cannot simplify the implementation a bit.

Comment thread base/src/test/java/ai/javaclaw/speech/MockSpeechToTextService.java
Comment thread base/src/main/java/ai/javaclaw/speech/OpenAiSpeechToTextService.java Outdated

@Service
@ConditionalOnProperty(name = "speech.provider", havingValue = "whisper-cpp")
public class WhisperCppSpeechToTextService implements SpeechToTextService {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking for a java library that does speech to text and I found vosk: https://github.com/alphacep/vosk-api. If it works, what do you think of making it the default @soufianebouaddis? We could also drop this implementation which requires having both ffmpeg and whisper-cli.

Comment thread .gitignore Outdated
@soufianebouaddis
Copy link
Copy Markdown
Author

Hi @auloin, I ran into an issue when handling voice messages from Telegram. In some cases, the LLM response becomes too large, which leads to a failure when sending the message back.
Would you prefer that I implement message chunking to split long responses, or should we explore another approach?

Interesting finding @soufianebouaddis. Does it happen often? Is it a blocker? I wonder if it can be a separate task so we can keep the scope of this PR small.

Hi @auloin, I ran into an issue when handling voice messages from Telegram. In some cases, the LLM response becomes too large, which leads to a failure when sending the message back.
Would you prefer that I implement message chunking to split long responses, or should we explore another approach?

Interesting finding @soufianebouaddis. Does it happen often? Is it a blocker? I wonder if it can be a separate task so we can keep the scope of this PR small.

…/speech/MockSpeechToTextService and refactor OpenAiSpeechToTextService to deletate it to SpringAI
@soufianebouaddis
Copy link
Copy Markdown
Author

Hi @auloin, thanks again for the review.

I pushed a new revision addressing the first two remarks:

MockSpeechToTextService has been moved out of production code into src/test, since keeping it as a default runtime fallback was indeed misleading.
OpenAiSpeechToTextService was removed from base/ and reworked under providers/openai/ as a thin adapter over Spring AI’s AudioTranscriptionModel, which aligns better with the existing provider structure and avoids the custom HTTP/multipart handling.

I also spent some time looking into Vosk as an alternative to WhisperCppSpeechToTextService. It is definitely attractive from a portability standpoint since it removes the whisper-cli dependency, but Telegram voice messages still come in OGG format and Vosk expects WAV input, so an audio conversion step is still required unless we introduce an additional Java decoder.

As for the message is too long exception I hit during testing, it does not happen often and it is not related to voice handling itself it only occurs when the generated agent reply exceeds Telegram’s 4096 character limit, which can happen with regular text messages as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants